Consolidated Segmentation and Churn Analysis of Bank Clients

By: Tamjid Ahsan


As capstone project of Flatiron Data Science Bootcamp.

ABSTRACT


Attracting new customers is no longer a good strategy for mature businesses since the cost of retaining existing customers is much lower. For this reason, customer churn management becomes instrumental for any service industry.

This analysis is combining churn prediction and customer segmentation and aims to come up with an integrated customer analytics outline for churn management. There are six components in this analysis, starting with data pre-processing, exploratory data analysis, customer segmentation, customer characteristics analytics, churn prediction, and factor analysis. This analysis is adapting OESMiN framework for data science.

Customer data of a bank is used for this analysis. After preprocessing and exploratory data analysis, customer segmentation is carried out using K-means clustering. A Random Forest model is used focusing on optimizing f-1 score to validate the clustering and get feature importance. By using this model, customers are segmented into different groups, which sanctions marketers and decision makers to implement existing customer retention strategies more precisely. Then different machine learning models are used with the preprocessed data along with the segmentation prediction from the K-means clustering model. For this type of modeling, models were optimized for precision. To address class imbalance Synthetic Minority Oversampling Technique (SMOTE) is applied to the training set. For factor analysis feature importance of models are used. Based on cluster characteristics, clients are labeled as Low value frequent users of services, High risk clients, Regular clients, Most loyal clients, and High value clients. Final model accuracy is 0.97 with good precision of predicting churn at around 0.93.

OVERVIEW



head

Customer churn is a big issue that occurs when consumers abandon your products and go to another provider. Because of the direct impact on profit margins, firms are now focusing on identifying consumers who are at danger of churning and keeping them through tailored promotional offers. Customer churn analysis and customer turnover rates are frequently used as essential business indicators by banks, insurance firms, streaming service providers, and telecommunications service providers since the cost of maintaining existing customers is significantly less than the cost of obtaining a new one.

When it comes to customers, the financial crisis of 2008 changed the banking sector's strategy. Prior to the financial crisis, banks were mostly focused on acquiring more and more clients. However, once the market crashed after the market imploded, banks realized rapidly that the expense of attracting new clients is multiple times higher than holding existing ones, which means losing clients can be monetarily unfavorable. Fast forward to today, and the global banking sector has a market capitalization of $7.6 trillion, with technology and laws making things easier than ever to transfer assets and money between institutions. Furthermore, it has given rise to new forms of competition for banks, such as open banking, neo-banks, and fin-tech businesses (Banking as a Service (BaaS))[1]. Overall, today's consumers have more options than ever before, making it easier than ever to transfer or quit banks altogether. According to studies, repeat customers seem to be more likely to spend 67 percent more on a bank's products and services, emphasizing the necessity of knowing why clients churn and how it varies across different characteristics. Banking is one of those conventional sectors that has undergone continuous development throughout the years. Nonetheless, many banks today with a sizable client base expecting to gain a competitive advantage have not tapped into the huge amounts of data they have, particularly in tackling one of the most well-known challenges, customer turnover.

Churn can be expressed as a level of customer inactivity or disengagement seen over a specific period. This expresses itself in the data in a variety of ways e.g., frequent balance transfers to another account or unusual drop in average balance over time. But how can anyone look for churn indicators? Collecting detailed feedback on the customer's experience might be difficult. For one thing, surveys are both rare and costly. Furthermore, not all clients receive it, or bother to reply to it. So, where else can you look for indicators of future client dissatisfaction? The solution consists in identifying early warning indicators from existing data. Advanced machine learning and data science techniques can learn from previous customer behavior and external events that lead to churn and use this knowledge to anticipate the possibility of a churn-like event in the future.


Ref:

[1] Business Insider

[2] Stock images from PEXELS

BUSINESS PROBLEM


head

While everyone recognizes the importance of maintaining existing customers and therefore improving their lifetime value, there is very little banks can do about customer churn when they don't anticipate it coming in the first place. Predicting attrition becomes critical in this situation, especially when unambiguous consumer feedback is lacking. Precise prediction enables advertisers and client experience groups to be imaginative and proactive in their offering to the client.

XYZ Bank (read: fictional) is a mature financial institution based in Eastern North America. Recent advance in technology and rise in BaaS is a real threat for them as they can lure away the existing clientele. The bank has existing data of their clients. Based on the data available, the bank wants to know whom of them are in risk of churning.

This analysis focuses on the behavior of bank clients who are more likely to leave the bank (i.e. close their bank account, churn).

IMPORTS

OBTAIN

The data for this analysis is obtained from Kaggle, titled "Credit Card customers" uploaded by Sakshi Goyal. Which can be found here, this dataset was originally obtained from LEAPS Analyttica. A copy of the data is in this repository at /data/BankChurners.csv.

This dataset contains data of more than 10000 credit card accounts with around 19 variables of different types as of a time point and their attrition indicator over the next 6 months.

Data description is as below:

Variable Type Description
Clientnum Num Client number. Unique identifier for the customer holding the account
Attrition_Flag obj Internal event (customer activity) variable - if the account is closed then 1 else 0
Customer_Age Num Demographic variable - Customer's Age in Years
Gender obj Demographic variable - M=Male, F=Female
Dependent_count Num Demographic variable - Number of dependents
Education_Level obj Demographic variable - Educational Qualification of the account holder (example: high school, college graduate, etc.)
Marital_Status obj Demographic variable - Married, Single, Divorced, Unknown
Income_Category obj Demographic variable - Annual Income Category of the account holder (< $40K, $40K - 60K, $60K - $80K, $80K-$120K, > $120K, Unknown)
Card_Category obj Product Variable - Type of Card (Blue, Silver, Gold, Platinum)
Months_on_book Num Months on book (Time of Relationship)
Total_Relationship_Count Num Total no. of products held by the customer
Months_Inactive_12_mon Num No. of months inactive in the last 12 months
Contacts_Count_12_mon Num No. of Contacts in the last 12 months
Credit_Limit Num Credit Limit on the Credit Card
Total_Revolving_Bal Num Total Revolving Balance on the Credit Card
Avg_Open_To_Buy Num Open to Buy Credit Line (Average of last 12 months)
Total_Amt_Chng_Q4_Q1 Num Change in Transaction Amount (Q4 over Q1)
Total_Trans_Amt Num Total Transaction Amount (Last 12 months)
Total_Trans_Ct Num Total Transaction Count (Last 12 months)
Total_Ct_Chng_Q4_Q1 Num Change in Transaction Count (Q4 over Q1)
Avg_Utilization_Ratio Num Average Card Utilization Ratio

There are unknown category in Education Level, Marital Status, and Income Category. Imputing values for those features does not make sense. And it is understandable why those are unknown in the first place. Information about Education and Marital status is often complicated and confidential; and customers are reluctant to share those information. Same for the income level. It is best for the model to be able to handle when those information is not available and still produce prediction.

Because of this reason those are not imputed in any way for this analysis.

There is major class imbalance spotted in the target column.

No null values to deal with. Features have the correct data type. No unknown category is spotted, and statistics does not warrant any closer inspection.

EDA

In this dataset, around 16% clients has halted their affiliation with the bank.

Category Observation
Marital Status Being married or single has little impact on them churning
Card Category Blue category severely out weighs the other card categories
Gender Slightly more female clients than men, overall almost similar churning possibility
Education Level Most of the clients of the bank are graduate, given the size of each class, churn rate is very similar
Income Category Most of the clients earn less than 40K.

Most of them are not normally distributed. Using logistic regression might not be the best performing model for this analysis. As it fail to meet linearity assumption most of the time. For this feature transformation will be required.


Feature Observation
Customer Age Normal distribution for age
Dependent count ordinal variable ranging one to five
Months on book Almost normal distribution except a huge spike at 36 moth point and a gap at every 6 month interval
Total Relationship Count ordinal variable, majority of clients have 3 or more relationship
Months_Inactive_12_mon most customers don’t stay inactive more than 3 months
Contacts_Count_12_mon ordinal variable, most values in 2 and 3
Credit Limit Almost log normal distribution, maximum credit limit offered is 35k.
Total Revolving Bal ignoring a spike of 0, this distribution has almost normal distribution, with a fat tail at the right end
Avg Open To Buy log normal distribution
Total_Amt_Chng_Q4_Q1 normal distribution with skinny ling tail towards right
Total Trans Amt seems like there are four normal distribution here, this can be a strong deciding feature for use in segmentation
Total Trans Ct normal distribution with skinny ling tail towards right
Total_Ct_Chng_Q4_Q1 good distribution but far from being normal distribution
Avg Utilization Ratio Log normal distribution, a very few people are using their total credit limit. This expected as very few people does so.

There is no clear pattern spotted. Every client age group is similarly likely to churn.

Clients with lower credit limit utilization ratio is more likely to churn. They have a less steep regression line. Also, Clients with lower credit limit with high utilization has more risk of churning.

Clients inactive for 3 to 4 month has a higher risk of churning.

SCRUB

As spotted before, class imbalance in the target column will be addressed by synthetic oversampling later in this section.

Label encoding

Train-Test split

Encoding & Scaling

Pipeline

SMOTENC

Oversampled to have around 13K samples for training prediction model.

MODEL

Client Segmentation

For segmentation modeling independent variables are not oversampled.

Tried different versions of the dataset for modeling,

Performance is mostly indifferent, following is the data preparation steps for segmentation model.

Finding "K"

Several k-means models were used to deduce optimal number of segmentation. Number of cluster size used ranged from 1 to 20.

Higher Silhouette Coefficient score relates to a model with better defined clusters. And higher Calinski-Harabasz score relates to a model with better defined clusters.

Although by looking at the visual no obvious optimal K can not be spotted. Based on the Silhouette Score and Sum of squared error (a.k.a. Elbow plot), 5 segmentation seemed optimal for initial model. Calinski Harabasz Score also supports this segmentation.

Customers are segmented by 5 groups by their characteristics.

Among models run for K from a range of 2 to 10, 5 is recommended by yellowbrick package.

Mean shift clustering aims to discover “blobs” in a smooth density of samples. It is a centroid-based algorithm, which works by updating candidates for centroids to be the mean of the points within a given region. These candidates are then filtered in a post-processing stage to eliminate near-duplicates to form the final set of centroids. (From scikit learn documentation)

Suggestion of MeanShift supports the initial choice of K=5.

Selecting "K"

Segmentation is not immediately apparent in this visualization. More insights on the segmentation is in the INTERPRET part of this analysis. Using PCA to explore the segmentation.

Using principal component analysis concept for reducing features to visualize the clusters in a three dimensional space.

With only forty percent explainability of the entire dataset by PCA, the clusters exhibit a clear separation between them in a three dimensional space. And thre is a clear separation between clusters in a two dimensional space. I am content with the selected K of 5. This will be further evaluated when performing inter cluster exploration in later part.

Clustering Feature importance

Newly created cluster_df is used to get the feature importance to get insights which features were often used for determining the segmentation. A Random Forest model is used to get feature importance alongside a permutation importance analysis to get the most important features.

Model fit is good with good performance metrics and no sign of overfitting. Prediction precision is over .90 for most of the classes expect cluster 3, which differs from model run to run based on train-test split, still close to .90 most of the time.

By looking at the above chart, these 10 features are selected as the most important features. Those will be explored in the later part of the notebook.

Segmentation Characteristics

intra cluster EDA

Exploration of clusters with an interactive plot.

inter cluster EDA

Exploring features among clusters based on the insights from the feature importance from the previous part of the analysis. Only the most important features decided at the previous part are explored.

Cluster Distribution

Cluster 0 has the lowest member. Cluster 1 and 2 are fairly similar sized. Cluster 3 and 4 have moderate members.

Customer Age

Cluster 4 and 1 has similar distribution. Cluster 0 is younger. Cluster 3 is distinct as it is mostly comprised of older clients. Others have similar distribution.

Credit Limit
Avg Utilization Ratio
Months on book

All of them show similar spread except Cluster 3, they are the most loyal clients.

Total_Trans_Amt

Cluster 0 has highest transaction amount. Rest of the has similar pattern.

Avg_Open_To_Buy
Total_Trans_Ct
Total_Revolving_Bal
Total_Relationship_Count

Cluster 0 mostly comprised of lower relationship count clients. Rest of the Clusters has similar distributions.

Dependent_count

All of them are mostly similar.

with churn info

All the features are explored with respect of churning.

Summary of exploring clusters by the most important features. This is done by interpreting results and taking note to create a summary table. All the intra-cluster and intra-cluster plots are considered for this. For this purpose Microsoft Excel is used.


Variable Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Churn Comment Description
Avg_Open_To_Buy spread low low low high value 1 Majority values are low Open to Buy Credit Line (Average of last 12 months)
Avg_Utilization_Ratio low utilization minimal low utilization no low utilization ratio med utilization low utilization 1 Majority values are low Average Card Utilization Ratio
Card_Category 1 High class imbalance to comment Type of Card (Blue, Silver, Gold, Platinum)
Contacts_Count_12_mon 1 3 No. of Contacts in the last 12 months
Credit_Limit all clients from2k mostly low limit 2k to 4k, no high limit high limit, above 14k 1 Credit Limit on the Credit Card
Customer_Age similar similar similar older similar 3 Customer's Age in Years
Dependent_count spread spread spread low spread 1 count 3 and 4 is risky Number of dependents
Education_Level Graduate Graduate College College Uneducated 1 Graduates >HS >= Unknown>=Uneducated, PG and PhD is less likely Educational Qualification of the account holder (example: high school, college graduate, etc.)
Gender M F F F M 1 Females is risky M=Male, F=Female
Income_Category Less_than_40K 40K_to_60K 40K_to_60K Less_than_40K Unknown 1 Less than 40K Annual Income Category of the account holder (< $40K, $40K - 60K, $60K - $80K, $80K-$120K, > $120K, Unknown)
Marital_Status Unknown Single Married Married Unknown 1 Majority values is Married Married, Single, Unknown , Divorced
Months_Inactive_12_mon 1 3 No. of months inactive in the last 12 months
Months_on_book good similar similar loyal customer similar 3 Time of Relationship
Total_Amt_Chng_Q4_Q1 1 High frequency if transaction Change in Transaction Amount (Q4 over Q1)
Total_Ct_Chng_Q4_Q1 1 Change in Transaction Count (Q4 over Q1)
Total_Relationship_Count low high high high high 1 2 and 3 are most frequent Total no. of products held by the customer
Total_Revolving_Bal spread low mod spread spread 1 Majority values are low Total Revolving Balance on the Credit Card
Total_Trans_Amt High transaction amount low mid amount till 5k high feq transaction mid amount till 5k high feq transaction mid amount till 5k med feq transaction 1 low amounts Total Transaction Amount (Last 12 months)
Total_Trans_Ct heavy user moderate user moderate user moderate user moderate user 1 Majority values are between 30 to 50 Total Transaction Count (Last 12 months)

Churn Prediction

Prediction from the clustering model is used as a feature for modeling churn prediction model. Models without this feature was also experimented. Those models had a slightly worse performance. For the final modeling approach, dataset containing predictions from the kmeans model is used.

Baseline model

The baseline model is performing as par as random chance of flipping a coin for prediction.

Logistic Regression

'Avg_Open_To_Buy' with Credit_limit, 'Card_Category_Silver' with 'Card_Category_Blue, 'Gender_M' with 'Gender_F, 'Months_on_book' with 'Customer_Age', 'Total_Trans_Ct' with 'Total_Trans_Amt features are showing high multicollinearity. Those are expected by the nature of those features.

Multicollinearity undermines the statistical significance of an independent variable. Here it is important to point out that multicollinearity does not affect the model's predictive accuracy. Choosing not to deal with this issue right now.

Model is not good enough to predict target class 1, churned customer. Although accuracy is good.

The accuracy is good enough. But the the residual must be crazy as indicated by the f-1 and precision values. Supports my previous point about model performance. Outlier removal is next. Not pursuing that because data loss will be very high as there are lots of recurring values for the numeric values (lots of zeros) for both IQR and Z-score based approach for outlier removal.

Critical features for churning:

Odds ratios are used to measure the relative odds of the occurrence of the outcome, given a factor of interest [Bland JM, Altman DG.(2000), The odds ratio]. The odds ratio is used to determine whether a particular attribute is a risk factor or protective factor for a particular class and the magnitude of percentage effect is used to compare the various risk factors for that class. The positive percentage effect means that the factor is positively correlated with churn and vice versa.

The odds ratio and percentage effect of each feature are estimated as $\mathbf{OddsRatio} = e^{\Theta }$ and $\mathbf{Effect (\%)} = 100 * (OddsRatio - 1)$, where $\Theta$ is the value of weight of each feature in Logistic Regression model. If the effect is positive, the greater the factor, the likely that the client will churn, those factors are considered as risk factors. While if the effect is negative, the greater the factor, the greater the possibility that the customer will not churn, and can be considered as protective factors. This is a Bayesian approach for identifying feature importance.

Greater risk factors are Customer_Age, Credit_Limit, Avg_Open_To_Buy, Contacts_Count_12_mon, Months_Inactive_12_mon. Cluster 1 is the most likely to churn.

Random Forest

OG data

OS data

Gridsearch did not find better model. precision for target class of 0 is worse than previous model.

XGBoost

XGBClassifier

Model is not overfitting. Good test accuracy and the highest precision for target class of 1, which represents churning. (Numbers vary sightly between runs)

Model performance is mostly similar with all the extensive (expensive in term of runtime) grid search.

XGBRFClassifier

Significantly worse performance for predicting target class than previous model.

Best model

XGBClassifier type model deemed the best model type for predicting churning. It shows best fit and model performance. Here is the model report for that model.

Additional interpretation with insights can be found in the INTERPRET section of the analysis.

INTERPRET

Customer Segmentation model

Based on analysis from the segmentation part and exploration of the clusters, they can be be identified as following:

NOTE: labels can change on different runs.

Churn Prediction model

Using SHAPely values to explain this model. SHAP (SHapley Additive exPlanations) is a game-theoretic approach to explain the output of any machine learning model. (source)

Features are sorted by the sum of SHAP value magnitudes over all samples. It also shows the distribution of the impacts each feature has. The color represents the feature value:

Here high represents category 1 (Client Churn).

Feature Observation
Total_Trans_Ct Low value means higher risk of churning
Total_Trans_Amt Above agerage value means higher risk of churning
Total_Revolving_Bal Low value means higher risk of churning
Total_Relationship_Count More relationship indicates more chance of churning
Total_Amt_Chng_Q4_Q1 Low value means higher risk of churning
Total_Ct_Chng_Q4_Q1 Low value means higher risk of churning
Months_Inactive_12_mon Higher value means higher risk of churning
Contacts_Count_12_mon Higher value means higher risk of churning

And so on.


If a client has low transaction count, and high transaction value with low revolving balance while being inactive for long time there is higher chance for them to churn. Less interaction with the bank has chance of turning into attrition. These attributes can be considered as warning sign.

RECOMMENDATION & CONCLUSION

Cluster 1 is the most riskiest client segmentation. They should be offered deals to make them stick with the bank.

As a rule of thumb:

This churn prediction model can be valuable for marketers to identify clients with higher risk of churning. This is invaluable for marketers to be able to identify potential customers as well as customers who are on verge of leaving for any reason. Simply by identifying and reaching out to them can reduce customer dissatisfaction and can retain a substantial portion of them.

NEXT STEPS

Modeling aspect: Gaussian Mixture Models for segmentation modeling, and Neural Network based approach for prediction model.

Business need aspect: A part of the business challenge is determining how soon you want the model to forecast. A prediction that is made too long in advance may be less accurate. A narrow prediction horizon, on the other hand, may perform better in terms of accuracy, but it may be too late to act after the consumer has made her decision.

Finally, it is critical to establish whether churn should be characterized at the product level (customers who are likely to discontinue using a certain product, such as a credit card) or at the relationship level (client likely to extricate from the bank itself). When data is evaluated at the relationship level, you gain a wider insight of the customer's perspective. Excessive withdrawals from a savings account, for example, may be used to pay for a deposit on a house or education costs. Such insights into client life events are extremely effective not just for preventing churn, but also for cross-selling complementary items that may enhance the engagement even further. This can be done with more information about the customers if there is product level data is available.

APPENDIX

Environment setup

For running this locally please follow instructions from './assets/req/README.md'.

all functions and imports from the functions.py and packages.py

Dashboard

Online

COMING SOON


Local

run viz_dash.py for dashboard with insight and prediction. Dashboard_jupyter.ipynb contains JupyterDash version for running dash inside jupyter notebook.

Snapshot: dash